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ABSTRACT 

Biomedical researchers share a common challenge of 
making complex data understandable and accessible as 
they seek inherent relationships between attributes in 
disparate data types. Data discovery in this context is 
limited by a lack of query systems that efficiently show 
relationships between individual variables, but without 
the need to navigate underlying data models. We have 
addressed this need by developing Harvest, an open- 
source framework of modular components, and using it 
for the rapid development and deployment of custom 
data discovery software applications. Harvest 
incorporates visualizations of highly dimensional data in 
a web-based interface that promotes rapid exploration 
and export of any type of biomedical information, 
without exposing researchers to underlying data models. 
We evaluated Harvest with two cases: clinical data from 
pediatric cardiology and demonstration data from the 
OpenMRS project. Harvest's architecture and public 
open-source code offer a set of rapid application 
development tools to build data discovery applications 
for domain-specific biomedical data repositories. All 
resources, including the OpenMRS demonstration, can be 
found at http://harvest.research.chop.edu 



INTRODUCTION 

Biomedical researchers are often challenged with 
navigating the large volumes of data available from 
medical and research information systems/ Datasets 
useful to biomedical research are typically complex, 
highly dimensional, and temporal, often w^ith sig- 
nificant variation in granularity, sparsity, and repre- 
sentation across data dimensions.^ Research data 
complexity is amplified by the high volume of data 
points generated by modern molecular and imaging 
platforms.^ ^ Unlike purely transactional data, such 
as those typically derived from business operations, 
these data are not readily summed or averaged, lim- 
iting the utility of traditional business intelligence 
tools in this context. Moreover, existing query and 
reporting tools tend to be general-purpose instru- 
ments M^ith user interfaces designed to support 
expert analysts w^orking in a variety of situations.^ 
As a result, researchers w^ithout access to sophisti- 
cated informatics expertise are increasingly chal- 
lenged M^ith efficiently managing, exploring, and 
understanding the information at their disposal. 

Accordingly, w^e developed Harvest, a new^ bio- 
medical data application framew^ork. Our primary 
development objectives w^ere to (1) provide for 
researchers w^ith limited informatics ability a toolkit 
to generate meaningful view^s of raw^ data according 
to their domain expertise and their specific interests; 
(2) dynamically query key aspects of a dataset based 



on the inherent characteristics of individual data 
attributes; (3) combine single attribute queries into 
multiattribute set operation queries; and (4) provide 
an actionable endpoint by exporting immediately 
available raw^ data in an analysis-ready format. To 
demonstrate its effectiveness. Harvest was used to 
develop and deploy intuitive data discovery applica- 
tions for two distinctive biomedical domains: pedi- 
atric cardiology diagnostic modality and procedure 
data generated at The Children's Hospital of 
Philadelphia (CHOP), and infectious disease data 
published by the OpenMRS open-source electronic 
health record (EHR) project. 

BACKGROUND AND SIGNIFICANCE 

Adoption of EHR systems by academic medical 
centers has created significant potential for the 
re-use of clinical data for research.^ Trends in trans- 
lational research indicate a need for data discovery 
platforms that can stage and disseminate data in a 
readily accessible form to researchers focusing on 
disease. Such trends include the increasing adoption 
and diversity of EHRs to capture longitudinal 
patient information; rapid development and early 
adoption of genomics, imaging, and other complex 
data types; the progressive organization of multidis- 
ciplinary teams focusing on systems biology 
research; and the need for integration and exchange 
of many different types and complexities of data. 
We sought a solution that w^ould enable rapid iter- 
ation of design ideas and provide the flexibility to 
adapt to the accelerated pace of innovation in 
diverse research settings. 

Many available analytic applications operate on 
arbitrary datasets, including SAP Business Objects, 
IBM Cognos, QlikTech QlikView, and TIBCO 
SpotEire. Many of these tools are specific for busi- 
ness intelligence (BI) and perform well on quantita- 
tive transactional data aggregated across multiple 
dimensions. How^ever, commercial BI tools have 
not yet been w^idely adopted in the research com- 
munity, owning in part to the limitations of these 
tools for analyzing highly multidimensional data 
arising from discrete observations. Eor instance, 
aggregation of observations by counting is of little 
value to researchers w^hen the data are categorical, 
and of marginal use for non-epidemiological 
patient-oriented research when the data are quanti- 
tative. Both QlikView^ and SpotEire allows for these 
features but are limited by their architectures, 
w^hich rely on in-memory data stores that present 
performance barriers for Big Data applications. 
Moreover, the cost of Hcensing these tools for 
open-access internal and external academic use can 
be prohibitive. 
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The variability of research data from project to project requires 
either a monoHthic appHcation data model — ^which optimizes 
efficiency from application development and operational stand- 
points — or custom code developed for every appHcation. One 
strategy for a monolithic database schema is to use an entity- 
attribute -value (EAV) model. An EAV model uses key-value fields 
in a general-purpose table to extensibly store arbitrary data 
w^ithout having to create dedicated tables, as exemplified in bio- 
medicine by the i2b2 (informatics for integrating biology and the 
bedside) project.'^ An EAV approach provides the convenience of 
a fixed database model, but at the expense of the benefits pro- 
vided by normalized relational models such as database-level ref- 
erential integrity and performance optimization through 
indexing.^ While successful in many domains, EAV models have 
difficulty supporting ad hoc, attribute-centric queries on highly 
dimensional data,^ such as clinical and annotated genomic data. 

Alternatively, custom code built on an application-specific data- 
base model makes sense w^hen the application requirements are 
sufficiently unique for a generic solution to be impracticable. Our 
experience w^ith numerous collaborators indicated commonality in 
many functional requirements among projects — namely, a capacity 
to iteratively brow^se, search, query, reviev^ and export project- 
specific data, such that a certain level of generic functionality (and 
code) was clearly feasible. How^ever, requirements differed in the 
data model, w^here w^e were faced w^ith a wide variety of data types 
among projects. Based on the success of the BI tools discussed 
above, w^e reasoned that an application data access layer might 
solve the problem of generic access to highly variable database 
schemas. We hypothesized that a hybrid solution might handle an 
arbitrary schema w^ith minimal to no custom code in the data 
access layer, w^hile simultaneously supporting rapid development 
of data discovery applications customized to the domain of inter- 
est. Accordingly, we developed the Harvest framew^ork both for 
our own use and dissemination to others. 

MATERIALS, METHODS, AND TECHNOLOGIES 
Harvest core components and implementation 

Harvest comprises three main components: a data abstraction 
layer (Avocado), a w^eb API (Serrano), and a w^eb client 
(Cilantro). Avocado and Serrano dine implemented using 
Python^^ and the Django w^eb application framew^ork.^^ 
Cilantro is implemented using JavaScript. Detailed documenta- 
tion of each component is available online. 

Avocado 

The core of Harvest is a data abstraction component termed 
'Avocado'. Avocado extends the Django object-relational mapper^"^ 
to provide a stable and contextual application programming inter- 
face (API) for client data access. Harvest applications rely on 
Avocado to generate and manage application metadata, especially 
automatically generated data profiles that reveal and publish inher- 
ent characteristics of the mw data, such as determination of cat- 
egorical versus continuous type, and indexing of text data for 
subsequent search. Avocado also supports additional metadata 
management, including providing an alias for data fields w^ith 
human-readable concept names, a searchable thesaurus of 
synonym keyw^ords describing concepts, and domain-specific cat- 
egories for grouping concepts w^ithin a client user interface. 
Avocado currently authorizes access to both data row^s and con- 
cepts through optional integration w^ith Django Guardian. 

Serrano server 

Harvest's Serrano server publishes a hypermedia API that 
enables w^eb clients to consume data encapsulated by Avocado 



and saves user data for Harvest applications. This is accom- 
plished by storing a representation of user actions into a history 
table in the application database. This supports several key fea- 
tures, including the ability for users to name and save queries, 
and provides an auditable record of user actions. 

Cilantro client 

The Cilantro JavaScript client generates and displays intuitive 
data visualizations such as histograms, bar charts, and pie charts 
for suitable database fields in real time using HighCharts JS.^^ 
Importantly, this feature allow^s users to see a summary profile 
of data even before constructing formal queries. In some 
instances, users interact directly w^ith the graphical displays to 
construct their queries — for example, by clicking on a bar of 
interest in a graph that represents a subpopulation (figure lA). 
The architecture of the client also allow^s for specialized query 
controls to be integrated when simple data-driven displays are 
not appropriate. Eor example, a diagnosis field might be pre- 
sented as a hierarchical view^ that allow^s a user to brow^se, 
search, and select one or more diagnoses of interest w^hile con- 
structing complex set operation queries (figure IB). 

Technologies 

Harvest implements a three-tiered application architecture using 
a relational database management system, weh application 
server, and w^eb brow^ser client. All server components are built 
using Python and the Django w^eb framew^ork. Harvest config- 
urations to date have used the open-source PostgreSQL^^ and 
SQLite^'^ databases, and the Nginx^^ and Apache^^ weh servers. 
How^ever, Harvest is readily compatible w^ith other database and 
weh server technologies. 

RESULTS 

Harvest deployments 

Harvest has been used to develop several biomedical data dis- 
covery applications spanning a variety of diseases and data 
types, tw^o of w^hich are described here. Typically, specific appli- 
cations are built on patient-oriented databases encompassing 
unique, domain-specific relational schemas. Each application 
also employs a unique configuration of Avocado metadata that 
provides researchers w^ith a user interface incorporating data 
heuristics, terminology, and an organization scheme specific to 
the area of study. 

Harvest instance: CardioDB 

CardioDB (figure lA) is a quality improvement data resource for 
measuring outcomes in the Cardiac Center at CHOI? using clinical 
data derived primarily from cardiac catheterization and echocar- 
diogram systems. The database and application includes 323 fields 
for 47 300 patients across 24 900 catheterization procedures and 
54 000 echocardiogram procedures, refreshed nightly. Data types 
include problem list, diagnosis vocabulary, procedure vocabulary, 
radiation exposure, cardiac dimensions, and cardiac output (see 
online supplementary figure SI). Diagnosis and procedure vocabu- 
laries are implemented in a hierarchical data model populated 
w^ith standard ICD9 and Current Procedural Terminology codes 
along M^ith CHOP-customized codes. The query interface to these 
vocabularies is implemented in a generic Cilantro module called 
the Harvest Vocab Browser}*^ CardioDB enables discovery and 
reporting of clinical outcomes by providing clinicians and 
researchers w^ith a longitudinal view^ of discrete, granular measures 
of patient cardiac function both before and after a catheterization 
procedure, together w^ith relevant data describing both the patient 
and procedures. Sensitive measurement of outcomes is supported 
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Figure 1 (A) Categorical data by 
default are displayed as bar or pie 
charts. Users click on chart elements of 
interest to select and add to the list of 
query conditions. (B) Custom controls 
may be developed to handle complex 
data types and query operations, such 
as this vocabulary browser that 
displays ICD9 diagnoses in a 
browseable and searchable hierarchy, 
together with input fields that enable 
element drag-and-drop supported 
construction of complex set-operation 
query conditions. Views displayed 
originate from the CardioDB Harvest 
application. 



Search Available Conditions 



(Q, Search... 



Cath Procedure Cath Radiation 




Echo Data Echo Study Encounter Patient info 



Access Vessel Type 




O Corxiltion has been added. 



I g Update Condition | 



I "Access Vessel Type is equal 
to Artery" 

[ ♦ Get Report] 



B 



s Of Pulmonary Circulation 
Other Forms Of Heart Disease 
Cerebrovascular Disease 



Diseases Of Arteries, Arterioles, And Capillaries 
Septic Arterial Embolism 

Diseases Of Veins And Lymphatics, And Other Diseases Of Circulatory System 



Selected Diagnoses 

Drag and drop between buckets to customize your query. 



At Least One 

Q TetraloQV Of Fallot 
Require Al 



Exclude Any Of These ^ 



Septic Pulmonary Embolism 
Chronic Pulmonary Embolism 



by use of Harvest's query interface to stratify CHOP's highly vari- 
able cardiac patient population. CardioDB is used by researchers 
and clinicians at CHOP with interest in abnormal cardiac anatomy 
and physiology. 



Harvest instance: OpenMRS 

We validated Harvest w^ith a public dataset originating outside 
of CHOI? for w^hich w^e have no specific interest or motivation, 
by developing a demonstration application (figure 2) using a 
deidentified clinical dataset^ ^ pubHshed by the OpenMRS open 
source EHR project.^^ OpenMRS data types include infection 
status, disease management, and clinical laboratory results (see 
online supplementary figure S2). The appHcation includes 56 
data fields on 5300 patients and is freely available online 
through the Harvest w^ebsite^^ for direct exploration, and as an 
installable package. The extract, transform, and load (ETL) 
package for the Harvest OpenMRS application is also available 
at the Harvest w^ebsite. 



In the course of building these applications, w^e have found 
that Harvest separates design, coding, and implementation from 
the definition and organization of data elements by domain 
experts. We have also found that the framew^ork enables rapid 
prototype development, as exemplified by the creation of the 
OpenMRS demonstration in just two days, excluding database 
creation and ETL. The Cilantro user interface and Avocado 
metadata promote intuitive and meaningful review^ that enables 
productive feedback and iterative development loops. 

DISCUSSION 

Our results indicate that Harvest facilitates construction of 
accessible biomedical data discovery applications by providing 
informatics researchers With, open, standards-based components, 
an adaptable framew^ork for defining domain-specific data con- 
cepts, and a user interface design that makes large and complex 
datasets accessible. We have used Harvest to support data dis- 
covery applications of highly multidimensional data across a 
variety of clinical and research domains. The two applications 
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Harvest+OpenMRS 



Figure 2 The query construction 
view is used to preview data, such as 
the distribution of white blood cell 
count, while building up query 
conditions that are displayed in a 
readable format. View originates from 
the OpenMRS Harvest application. 
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we describe signify generalizability as they share a common 
patient-encounter root data model but otherwise diverge in 
their detailed dimensions. Specifically, the CardioDB data model 
invokes procedural diagnostic, intervention, and cardiac func- 
tion data, whereas the OpenMRS data model focuses on infec- 
tion status, disease management, and clinical laboratory results. 
Harvest can support a variety of data discovery paradigms, 
including hypothesis generation and testing, clinical outcomes 
reporting, and multisite access to shared research data. 

Frameworks such as Harvest provide researchers and data 
reporting groups with a toolkit suitable for addressing a primary 
need of biomedical research: the ability to be rapidly and effect- 
ively immersed in interoperable data relevant to a study of inter- 
est, and in a way in which the data are readily comprehensible. 
In our experience, researchers have gained trust with specific 
Harvest applications, in part because the immediate presentation 
of data visualizations provides transparency of data content. 
Data are presented as collected, with outliers and potential data 
inconsistencies clearly observable even before a researcher is 
required to execute a query. Often, these visualizations are the 
first time researchers have seen variability and potential quality 
problems in their data. This incentivizes users to participate in 
iterative improvement, thus generating process investment and 
subsequent increased scientific value from the data. 

A decided strength of Harvest is the ease of reuse in a variety of 
biomedical domains. For example, we are currently developing 
Harvest instances to support medical imaging studies, multisite 
integration of clinical and molecular data, and genomic variant 
sets derived from next generation sequencing. These applications 
can completely reuse the Avocado and Serrano components, which 
greatly accelerates the development of a new client interface. 

All Harvest deployments thus far have resulted in rapid adop- 
tion by biomedical researchers, requiring little user interface 
training. The domain-specific conceptual and visual representa- 
tion of data enabled by Harvest seems to lower the cognitive 
burden on researchers to learn and understand an application's 
underlying data model before data discovery. This contrasts with 
our experience with commercial BI tools, which typically 



require an expert user to navigate both underlying data models 
and the extensive functionality implemented by these enterprise 
tools. 

Harvest provides basic authorization capabilities, enabling con- 
trolled access both to data rows and concepts/fields at the user or 
user group level. Further development of this authorization cap- 
ability to increase usability and transparency of the authorization 
scheme is warranted. A means to authorize access to application 
functionality, such as data export or drill-to-detail, is a promising 
area for development. 

Importantly, Harvest itself does not deal with the need to 
transform and integrate biomedical data from original source 
systems and formats into well-structured data suitable for query 
and analysis. We address this need through various ETL 
methods and tools. We have found that the rapid application 
development capability of Harvest accelerates the ETL develop- 
ment process, especially when using the Avocado data API to 
evaluate staged and partially integrated data. 

To date. Harvest has been primarily used in focused, project- 
specific applications, where the analysis and configuration 
required to take advantage of Avocado's conceptual representa- 
tion of data were feasible and justified by the expected end use 
of the application. We have yet to test the utility of Harvest in a 
more typical enterprise data warehouse scenario, where a poten- 
tially large number of data fields might impose challenges of 
scalability and required effort. Furthermore, while Harvest has 
been able to model and present highly complex data types, we 
have not yet fully dealt with the more difficult task of modeling 
longitudinal data in a manner that would enable construction of 
temporal query constraints. 

CONCLUSION 

Harvest promotes immediacy in data exploration and use in 
data-intensive clinical and translational science. For biomedical 
researchers. Harvest-based applications provide an accessible, 
easy-to-understand view of complex data, presented in a concep- 
tual framework that they help develop, and with focus on data 
content rather than application development. The Harvest platform 
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and the OpenMRS demonstration application are available as open 
source under the unrestricted BSD license agreement. The demon- 
stration application, system documentation tutorials, and ETL 
package may be found at http://harvest.research.chop.edu. 
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